feat(generators): add --deterministic flag with hybrid RDFC-1.0 + rdflib serialization by jdsika · Pull Request #1 · ASCS-eV/linkml

jdsika · 2026-03-25T15:19:20Z

Summary

Add a --deterministic flag to OWL, SHACL, and JSON-LD generators that produces byte-identical output across invocations, eliminating spurious diffs in version-controlled artifacts.

This is a review-ready fork of the approach discussed in upstream linkml/linkml#3295, rebuilt to address maintainer feedback.

Problem

Generated OWL and SHACL artifacts contain blank nodes whose identifiers change between runs due to Python dict ordering and rdflib serialization non-determinism. This makes version-controlled artifacts show massive diffs even when the underlying schema change is trivial.

Solution

Three-Phase Hybrid Pipeline (`deterministic_turtle()`)

RDFC-1.0 canonicalization (W3C Recommendation) via pyoxigraph ensures isomorphic inputs produce identical triple sets.
Weisfeiler-Lehman structural hashing replaces sequential _:c14nN identifiers with content-based hashes. These depend only on predicate IRIs, literal values, and named-node IRIs — not on blank-node numbering — so adding or removing a triple only affects directly involved blank nodes.
Hybrid rdflib re-serialization parses the canonicalized, WL-hashed triples back into an rdflib Graph and serializes with rdflib's native Turtle writer. This recovers idiomatic Turtle features that pyoxigraph cannot emit:
- Inline blank nodes ([ … ]) for singly-referenced blank nodes (Turtle §2.7)
- Collection syntax (( … )) for rdf:List chains (Turtle §2.8)
- Prefix filtering: only prefixes actually used in the graph's IRIs are declared, following the practice of Apache Jena, Eclipse RDF4J, and Raptor

All triples from the source graph are preserved — the hybrid step only changes syntactic form, never semantic content. Plain string literals have their xsd:string datatype stripped per RDF 1.1 §2.5.1 (simple literals are syntactic sugar for xsd:string).

Additional Features

Collection sorting (gated behind --deterministic):

owl:oneOf, sh:in, sh:ignoredProperties items are sorted when the flag is set
Preserves existing behaviour by default

deterministic_json():

Recursive deep-sort for JSON-LD context output

Benchmark Results

Tested on the Gaia-X Trust Framework ontology (~68K OWL / ~165K SHACL triples) and schema.org (~18K triples):

Semantic Equivalence

Artifact	Triples (det)	Triples (non-det)	`rdflib.compare.isomorphic()`
OWL	68,178	68,178	✅ `True`
SHACL	165,029	165,029	✅ `True`
schema.org	17,949	17,949	✅ `True`

Byte-Level Stability

Test	Deterministic	Non-Deterministic
SHA-256 identical across runs	✅	❌ (~1,400 lines differ)

Diff Quality (Signal-to-Noise Ratio)

Controlled mutations on a LinkML schema:

Mutation	Generator	Deterministic	Non-Deterministic	Noise Reduction
Change 1 description	OWL	1 line	344 lines	344×
Change 1 description	SHACL	13 lines	290 lines	22×
Add 1 new slot (+18 triples)	OWL	16 lines	350 lines	22×
Add 1 new slot (+7 triples)	SHACL	23 lines	305 lines	13×

Output Size (Gaia-X Trust Framework)

Artifact	Before (pyoxigraph-only)	After (hybrid)	Change
OWL	58,397 lines, 34 prefixes	58,291 lines, 11 prefixes	-0.2%
SHACL	163,993 lines, ~34 prefixes	9,118 lines, ~9 prefixes	-94%

The SHACL 18× size reduction comes from replacing 157,552 named _:bHASH blank nodes with inline [ … ] syntax and 77,358 explicit rdf:first/rdf:rest triples with ( … ) collection shorthand — matching the upstream Gaia-X registry convention.

Performance

Graph	Triples	Time	Throughput
schema.org	17,949	1.5s	~12,000 triples/s
Gaia-X OWL	68,178	~5s	~14,000 triples/s

Dependency

pyoxigraph >= 0.4.0 is imported lazily only when --deterministic is used. It is not a core dependency, avoiding conflict with morph-kgc's pin on pyoxigraph < 0.4.0. Tests skip gracefully when pyoxigraph >= 0.4.0 is unavailable.

Relationship to upstream linkml#3295

The original PR was closed after maintainer feedback requesting an established canonicalization standard. This PR:

Replaces custom WL canonicalization with W3C RDFC-1.0 via pyoxigraph
Retains WL only for blank node ID assignment (post-canonicalization remapping for diff stability — not a canonicalization algorithm)
Adds hybrid rdflib re-serialization for idiomatic Turtle output (inline blank nodes, collection syntax, prefix filtering)
Makes pyoxigraph an optional lazy import

Testing

37 tests covering idempotency, isomorphism, performance, diff quality, and edge cases
- test_deterministic_output.py: 27 tests (stability, sorting, prefix format, enum ordering, kitchen_sink)
- test_deterministic_benchmark.py: 10 local + 4 network tests (schema.org equivalence, mutation diff quality, signal-to-noise assertions)
Tests skip cleanly when pyoxigraph >= 0.4.0 is not available
All existing tests pass (9500+ across the full matrix)

Benchmark Test Assertions

The benchmark enforces quantitative properties:

Deterministic diff ≤ 20 lines for a single description change
Signal-to-noise ratio ≥ 5× (actual: 13–344×)
Diff proportional to new triples (≤ 5× margin)
SHA-256 byte-identical across 3 consecutive runs
Every declared prefix has at least one IRI using it
schema.org (17,949 triples) serializes in < 60s

References

W3C RDFC-1.0 — RDF Dataset Canonicalization
W3C Turtle 1.1 — Terse RDF Triple Language (§2.7 inline blank nodes, §2.8 collections)
Weisfeiler & Leman (1968) — graph canonical form
schema.org — benchmark ontology

jdsika · 2026-04-02T12:01:59Z

🔍 Adversarial Review — PR #1

Summary

A well-engineered feature with strong documentation and benchmark data. The three-phase pipeline (RDFC-1.0 → WL → rdflib) is architecturally sound but introduces significant complexity. I found 2 bugs (dead code shipped as functional features), 1 algorithmic concern in collision handling, and several design/test gaps worth addressing before merge.

🐛 Bugs & Issues

1. Dead code: normalize_prefixes field and CLI option do nothing

The normalize_prefixes: bool = False field is added to Generator (line ~429) and a --normalize-prefixes/--no-normalize-prefixes CLI option is registered — but self.normalize_prefixes is never read in any generator code in this PR. Users who pass --normalize-prefixes get silent no-op behavior. This should either be removed from this PR (it belongs in PR #4) or wired up.

# Added to Generator dataclass but never checked anywhere:
normalize_prefixes: bool = False

2. Dead code: well_known_prefix_map() defined but never called

The function is defined in generator.py but is not imported or called anywhere in this PR. It returns {str(ns): str(pfx) for pfx, ns in Graph().namespaces() if str(pfx)} — a dynamic map that changes across rdflib versions (see Concern #3). PR #4 redefines this with a static frozen map, creating a direct merge conflict.

3. WL collision counter assignment depends on c14n ordering, not structure

In _wl_signatures(), when two structurally identical blank nodes produce the same WL hash, the counter suffix (_0 vs _1) is assigned by iterating sorted(bnode_ids) — which sorts by canonical c14n ID (e.g., "c14n0" < "c14n42"):

for bid in sorted(bnode_ids):  # sorted by c14n ID, NOT by structure
    digest = hashlib.sha256(sig[bid].encode("utf-8")).hexdigest()[:12]
    count = seen_hashes.get(digest, 0)
    seen_hashes[digest] = count + 1
    label = f"b{digest}" if count == 0 else f"b{digest}_{count}"

Adding an unrelated triple can change RDFC-1.0 numbering, which changes which colliding node gets the base label vs _1 suffix. This defeats WL's core promise of diff-stable IDs in the (admittedly rare) collision case. Fix: use the full WL signature string as a secondary sort key before assigning counters:

for bid in sorted(bnode_ids, key=lambda b: (sig[b], b)):

⚠️ Concerns

1. well_known_prefix_map() is rdflib-version-dependent (direct conflict with PR #4)

Graph().namespaces() returns different prefix sets across rdflib 6.x vs 7.x (e.g., brick, csvw, geo were added/changed). This means "deterministic" output can change on dependency upgrade. PR #4 fixes this with _WELL_KNOWN_PREFIX_MAP: MappingProxyType — a static frozen map. Both PRs also add normalize_prefixes to Generator. These are two direct merge conflicts.

Recommendation: Remove well_known_prefix_map() and normalize_prefixes from this PR entirely. They're unused dead code here and belong exclusively in PR #4.

2. Sorting OWL expression members by repr is fragile

members = sorted(members, key=repr)

repr() for LinkML model objects depends on dataclass field ordering and internal representation. While stable within a single Python version, it can change across:

Python versions (if dataclass repr format changes)
linkml-runtime versions (if fields are reordered or types change)
Any field containing objects with id()-based repr

A more robust approach: define an explicit sort key using stable, semantic fields (e.g., key=lambda x: (x.range or "", x.minimum_value or "")) or serialize to a canonical string.

3. _mutate_kitchen_sink writes temp files to the test input directory

out_path = ks_path.parent / f"_benchmark_mutated_{os.getpid()}_kitchen_sink.yaml"

This writes to tests/linkml/test_generators/input/ which may be read-only in containerized CI. Also, yaml.dump(yaml.safe_load(...)) reformats the entire schema (quoting styles, flow/block, comment stripping), so the benchmark diffs measure YAML reformatting noise in addition to the intended mutation. Consider using tmpdir fixture or string manipulation instead of full YAML round-trip.

4. No RDFC-1.0 timeout for pathological graphs

The W3C RDFC-1.0 spec acknowledges exponential worst-case complexity for certain graph topologies. pyoxigraph.Dataset.canonicalize() has no timeout parameter exposed. A signal.alarm() or thread-based timeout would protect CI from hangs on adversarial input.

5. _deep_sort doesn't propagate parent_key into list items

sorted_items = [_deep_sort(item) for item in value]  # parent_key defaults to ""

If a list item is itself a list, it will be sorted regardless of whether the grandparent key is in _JSONLD_ORDERED_KEYS. Not a practical JSON-LD issue today, but a latent correctness gap.

6. O(n×m) prefix filtering

if pfx_s and any(iri.startswith(ns_s) for iri in used_iris):

With ~30 prefixes × thousands of IRIs, this is quadratic. A trie or pre-sorted binary search would scale better for large ontologies (though the current 68K-triple benchmarks likely absorb this).

7. xsd:string stripping is correct but lossy

The code strips explicit xsd:string datatype annotations during pyoxigraph→rdflib conversion. This is correct per RDF 1.1 §2.5.1 (simple literals = xsd:string), but users who intentionally annotated xsd:string for tooling compatibility lose the annotation after round-trip. Worth documenting this in the docstring.

🧪 Test Coverage Assessment

Strong coverage (✅):

Byte-level idempotency across runs (4 generators)
Sorted key verification for JSON generators
Prefix format (@prefix vs PREFIX)
Large schema stability (kitchen_sink)
xfail documentation of non-isomorphism trade-off
Non-deterministic mode regression guard

Gaps (❌):

No test for WL hash collisions: Need a graph with two structurally identical blank node subgraphs to verify counter-based dedup produces stable output
No test for deep blank node nesting (>4 levels): 4 WL iterations may be insufficient for chains of 5+ structurally similar nodes
No test for well_known_prefix_map(): Dead code, but if kept, needs coverage
No test for normalize_prefixes: Dead code — the CLI option silently does nothing
xfail tests document non-isomorphism but don't verify semantic equivalence: The assertion "OWL/SHACL interpret these as unordered sets" should be backed by a test that verifies the same classes/constraints exist in both modes (e.g., compare extracted sh:in value sets, not just triple counts)
Benchmark YAML round-trip may inflate diff counts: yaml.dump(yaml.safe_load(original)) reformats the entire file — diff measurements may include YAML formatting noise, not just the intended mutation
Network tests (@pytest.mark.network): schema.org download can fail in CI. No conftest.py marker filtering shown — these may run and flake by default
Performance test (10s limit): Fragile on slow CI runners; consider a more generous threshold or @pytest.mark.slow marker

📋 Fix Plan

Remove normalize_prefixes and well_known_prefix_map() from this PR — they are dead code, belong in PR fix(generators): add --normalize-prefixes flag for well-known prefix names #4, and create merge conflicts
Fix WL collision counter ordering: Sort by (sig[bid], bid) instead of just bid to make counter assignment structure-dependent
Add a WL collision test: Create a graph with two identical blank node subgraphs, verify both get stable IDs across runs
Replace repr sorting with explicit key function for OWL expression members
Use tmp_path fixture for _mutate_kitchen_sink output (create a symlink or copy imports alongside)
Add @pytest.mark.slow to performance tests and document CI threshold expectations
Document xsd:string stripping in deterministic_turtle() docstring explicitly as a known behavior

✅ What's Good

Excellent documentation: Thorough docstrings with W3C spec references, clear parameter docs, and well-written PR description with benchmark tables
Lazy import pattern: pyoxigraph is only loaded when --deterministic is used; graceful ImportError with actionable message
Defensive test design: pytestmark = pytest.mark.skipif(not _has_pyoxigraph, ...) and fixture-level pytest.skip() for network tests
The hybrid pipeline is clever: RDFC-1.0 for correctness, WL for diff stability, rdflib for Turtle readability — each phase has a clear purpose
The 94% SHACL size reduction (inline blank nodes + collection syntax) is a compelling result
xfail tests properly document trade-offs rather than hiding them
Prefix filtering removes the ~27 rdflib default bindings that leak into output — real quality-of-life improvement

@context

…lib serialization Add a --deterministic / --no-deterministic CLI flag (default off) to OWL, SHACL, JSON-LD Context, and JSON-LD generators that produces byte-identical output across invocations. Three-phase hybrid pipeline for Turtle generators: 1. RDFC-1.0 canonicalization (W3C Recommendation) via pyoxigraph 2. Weisfeiler-Lehman structural hashing for diff-stable blank node IDs 3. Hybrid rdflib re-serialization for idiomatic Turtle (inline blank nodes, collection syntax, prefix filtering) JSON generators use deterministic_json() with recursive deep-sort and JSON-LD-aware key ordering that preserves conventional @context structure. Collection items (owl:oneOf, sh:in, sh:ignoredProperties) are sorted when --deterministic is set to ensure reproducible RDF list order. pyoxigraph >= 0.4.0 is imported lazily only when --deterministic is used. Tests skip gracefully when pyoxigraph is unavailable. Refs: linkml#1847 Signed-off-by: Carlo van Driesten <carlo.van-driesten@bmw.de> Signed-off-by: jdsika <carlo.van-driesten@bmw.de>

jdsika force-pushed the feat/deterministic-output branch from bdb0f7a to 6544b72 Compare March 25, 2026 16:03

jdsika self-assigned this Mar 25, 2026

jdsika force-pushed the main branch from 0a15ba8 to a8fdeb0 Compare March 25, 2026 16:23

jdsika force-pushed the feat/deterministic-output branch 7 times, most recently from f0a081a to 7f529d6 Compare March 28, 2026 14:23

jdsika mentioned this pull request Mar 28, 2026

Possibility for deterministic order of multivalue slots in RDF output? linkml/linkml#1943

Open

jdsika force-pushed the feat/deterministic-output branch from 7f529d6 to 37cafc8 Compare March 29, 2026 19:34

jdsika changed the title ~~feat(generators): add --deterministic flag for reproducible output (pyoxigraph RDFC-1.0)~~ feat(generators): add --deterministic flag with hybrid RDFC-1.0 + rdflib serialization Mar 30, 2026

jdsika force-pushed the feat/deterministic-output branch 3 times, most recently from fb47790 to 8016a4b Compare April 2, 2026 09:34

jdsika mentioned this pull request Apr 2, 2026

fix(generators): add --normalize-prefixes flag for well-known prefix names #4

Open

jdsika force-pushed the feat/deterministic-output branch from d9c1a07 to 5da3f77 Compare April 2, 2026 13:01

jdsika force-pushed the main branch from 4c9ba04 to 0fa6f93 Compare April 2, 2026 14:45

jdsika force-pushed the feat/deterministic-output branch from 5da3f77 to cfaba19 Compare April 2, 2026 14:57

jdsika force-pushed the feat/deterministic-output branch from cfaba19 to c4ecf10 Compare April 2, 2026 15:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(generators): add --deterministic flag with hybrid RDFC-1.0 + rdflib serialization#1

feat(generators): add --deterministic flag with hybrid RDFC-1.0 + rdflib serialization#1
jdsika wants to merge 1 commit intomainfrom
feat/deterministic-output

jdsika commented Mar 25, 2026 •

edited

Loading

Uh oh!

jdsika commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jdsika commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Three-Phase Hybrid Pipeline (deterministic_turtle())

Additional Features

Benchmark Results

Semantic Equivalence

Byte-Level Stability

Diff Quality (Signal-to-Noise Ratio)

Output Size (Gaia-X Trust Framework)

Performance

Dependency

Relationship to upstream linkml#3295

Testing

Benchmark Test Assertions

References

Uh oh!

jdsika commented Apr 2, 2026

🔍 Adversarial Review — PR #1

Summary

🐛 Bugs & Issues

⚠️ Concerns

🧪 Test Coverage Assessment

📋 Fix Plan

✅ What's Good

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jdsika commented Mar 25, 2026 •

edited

Loading

Three-Phase Hybrid Pipeline (`deterministic_turtle()`)